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Abstract 

The inference of gene regulatory network (GRN) from gene expression data is an unsolved problem of great importance. 
This inference has been stated, though not proven, to be underdetermined implying that there could be many equivalent 
(indistinguishable) solutions. IVlotivated by this fundamental limitation, we have developed new framework and algorithm, 
called TRaCE, for the ensemble inference of GRNs. The ensemble corresponds to the inherent uncertainty associated with 
discriminating direct and indirect gene regulations from steady-state data of gene knock-out (KO) experiments. We applied 
TRaCE to analyze the inferability of random GRNs and the GRNs of £ coli and yeast from single- and double-gene KO 
experiments. The results showed that, with the exception of networks with very few edges, GRNs are typically not inferable 
even when the data are ideal (unbiased and noise-free). Finally, we compared the performance of TRaCE with top 
performing methods of DREAI\/14 in silico network inference challenge. 
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Introduction 

The discovery and analysis of biological networks have 
important applications, from finding treatment of diseases to 
engineering of microbes for the production of drugs and biofuels 
[1-4]. With continued advances in high throughput and omics 
technology, the inference of biological networks from omics data 
has received a great deal of interest. In particular, the inference of 
gene regulatory networks from gene expression data constitutes a 
major research topic in systems biology. In the last decade, the 
number of methodologies that are dedicated for the GRN 
inference has increased tremendously [5-7]. 

The Dialogue for Reverse Engineering Assessments and 
Methods (DREAM) project is a community-wide effort initiated 
to fulfill the need for a rigorous and fair comparison of the 
strengths and weaknesses of methods for the reverse engineering of 
biological networks from data. To this end, challenges involving 
the inference of cellular networks are organized on a yearly basis 
(http://www.the-dream-project.org/challenges). Specifically, the 
inference of GRN has become a major focus of several DREAM 
challenges. The outcomes of these challenges indicate that the 
state-of-the-art algorithms for GRN inference are unable to 
provide accurate and reliable network predictions, even when 
large expression datasets are available and the number of genes is 
small (10-100 genes) [8-10]. Nonetheless, a crowd-sourcing 
strategy that combines the predictions of different inference 
methods has been shown to be more reliable than any individual 
method [10]. 

Whether or not a direct regulation of one gene by another can 
be correctly inferred depends not only on the ability of an 
inference method to extract the relevant information from data. 



but also on the availability of such information in the data. In 
general, the information content of data is determined first and 
foremost by the conditions of the experiments. If the required 
information is unavailable or incompletely available, then the 
inference problem is underdetermined. In such a case, the network 
is not inferable from the data regardless of the method used. 

The underdetermined nature is not exclusive to the inference of 
GRNs. Much of the difficulty in the inverse modeling of signaling 
and metabolic networks can also be attributed to the lack of 
inferability or identifiability of model structure and parameters 
[1 1-14]. As the inference problem is underdetermined, there exist 
multiple solutions which are indistinguishable. The lack of model 
identifiability has motivated a paradigm shift toward ensemble 
modefing [15-18]. While such a strategy has begim to gain 
traction in the modeling of signaling and metabolic networks, the 
ensemble paradigm has not been widely used in the inference of 
GRNs. Also, since the network representation and data for GRNs 
differ markedly from those for signaling and metabolic networks, 
existing algorithms for ensemble modeling cannot be direcdy 
applied for the inference of GRN. 

In this work, we introduce new framework and algorithms, 
called Transitive Reduction and Closure Ensemble (TRaCE), for 
the ensemble inference of GRNs. Specifically, TRaCE produces 
the lower and upper bounds of the ensemble, i.e. the smallest 
network and the largest network that limit the complexity of 
networks in the ensemble. As the size of the ensemble reflects the 
uncertainty about the GRN inference, the bounds can also be used 
to analyze the inferability of GRNs. In this study, we have used 
TRaCE in two applications. First, we investigated the inferability 
of random GRNs and the GRNs of £. coli and S. cereviseae given 
steady-state gene expression data of single- and double-gene KO 
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experiments. Then, we applied TRaCE to simulated gene 
expression data, generated in the same manner as the DREAM 
4 in silico network inference challenge, and compared the 
performance of TRaCE with existing methods. 

Methods 

Theoretical Foundation 

Definitions. Here, we provide a short synopsis of basic 
concepts in graph theory that are necessary for the development of 
our algorithm. A graph G is an ordered pair (V(G), E(G)), where 
V( G) is the set of vertices (or nodes) and E{G) is the set of edges. 
The number of vertices n and tin; number of edges m are called 
the order and the size of the graph, respectively. An edge e is 
defined by the pair (iJ)eE(G) a V(G) X V{G), describing 
the existence of a relationship between two vertices / and j. In this 
case, the edge e= ( ij) is said to be inrAdent to the vertices / and j. 
The set of edges of a graph G that are incident to a vertex / is 
denoted by EaiOt while the cardinality of Ecii) is called the 
degree of the vertex Similarly, the set of edges that are incident 
to a set of vertices Za V(G) is denoted by Eq{Z) = {JiezEoii). A 
graph ^ is a subgraph of B, denoted by A<^B, if V{A) <= V{B) 
and E(A)<^E(B). In this case, B is called the supergraph oiA, and 
is also said to contain A . Furthermore, the union of two graphs A 
and B is denoted by AiJB, where E{A[JB)=E{A)\JE{B) and 
V{AiJB)= V{A)UV{B). The intersection of two graphs A and 
B is denoted by Ar\B, where E{A(^B)=E{A)r\E{B) and 
V(Ar]B)=V(A)r]V{B). Finally, the diflFerence between two 
graphs A and B with V(A) = V(B) is denoted by A — B and 
defined as the set of edges in the graph A that do not belong to the 
graph B, i.e. edges in the set dilference E{A) — E(B). 

A directed edge is an ordered pair {ij), representing an edge 
from the vertex / pointing to the vertex J. A directed graph or 
digraph G is a graph in which all of its edges are directed. A 
directed path is a sequence of vertices such that there exists a 
directed edge from one vertex to the next vertex in the graph. The 
first vertex in a directed path is called the start vertex, and the last 
is called the end vertex. A directed cycle is a directed path where the 
start and the end vertices are the same. A directed acyclic graph 
(DAG) is a digraph which does not contain any cycle. The 
adjacency matrix of a digraph G of order n, denoted by Adj( G), is 
an « X « matrix with Adjj j = 1 when {iJ)eE{G), and Adjij = 0 
otherwise. In other words, the non-zero elements of the adjacency 
matrix represent all directed edges from any node / to another 
node j in the graph G. Meanwhile, the accessibility matrix of G, 
denoted by Acc(G), is an n x n matrix with AcCij = \ when 
there exists a directed path from node / to node j, and AcCj j = 0 
otherwise. When AcCij = \, vertex j is said to be accessible from 
the vertex i. 

A strongly connected component or strong component of a 
digraph G is a maximal subset of nodes in G where any two nodes 
in the subset are mutually accessible. Every pair of nodes that are 
part of a directed cycle belong to the same strong component, 
while any node that is not part of a cycle is a strong component of 
its own. The condensation of a digraph G is the DAG of the strong 
components of G, which is generated by lumping the nodes 
belonging to a cycle into a single node and replicating the edges 
that are incident to any of these nodes onto the lumped node [19]. 

A digraph is transitive if for every pair of vertices and j, there 
exists an edge (jj) when there is a directed path from i to j. The 
transitive closure of a digraph G, denoted by G, is the smallest 
transitive supergraph of G (with the fewest edges) [20] . When G is 
a DAG, we denote the transitive closure of G as G^ . As shown in 
Fig. l(a)-(b), the transitive closure of a digraph can be generated by 



adding a directed edge {ij), whenever a directed path exists from 
vertex i to vertex / Note that the accessibility matrix of a digraph 
is the adjacency matrix of its transitive closure, i.e. 
Acc{G) =Adj{(j). For a digraph G, the set of digraphs that have 
the same transitive closure G is denoted by S{G) = {Gi : Gi = G}. 
The transitive reduction of G, denoted by G', is defmed as the 
smallest member of S{G) in size (i.e. the graph with the fewest 
edges). The transitive reduction of a DAG is unique, given by 
G' = r\Qi^S(G)Gi [20]. An algorithm for obtaining transitive 
reduction has been previously developed [19], in which any 
directed edge {ij) is pruned whenever there exists a directed path 
from i to j that does not include {ij) (for example, see Fig. 1(c)). 
Note that the transitive reduction of a digraph with cycles is not 
unique. 

Inference of Network Ensemble Bounds. In this work, we 
consider the inference of GRNs as digraphs, where the nodes 
correspond to the genes and the directed edges represent the gene 
regulatory interactions. An edge (->/ implies that the expression of 

gene ; influences the expression of gene j. In the following, the 
GRN of interest is denoted by the digraph G^. For any set of genes 
Vko V{G^), we use Gvi^o to denote a subgraph of G^ that results 
from removing all edges incident to the genes in the set Vko> 
EGf{VKo)- In other words, Gv^o is the digraph with 
V{GvKo)'=V{Ga) and E(Gv^o)= E{G^) - Eg^{Vko)- For ex- 
ample, G{5} associated with G^ in Fig. 1(a) is the graph with all 
edges incident to gene B removed, as shown in Fig. 1(d). Here, we 
interchangeably use the notations for a graph G and its adjacency 
matrix Adj{G). 

Gene KO experiments are commonly performed for the 
purpose of GRN inference. In these experiments, the resulting 
data typically consist of gene expression profiles taken after the 
effects of the gene perturbation have reached steady state. While 
temporcil gene expressions are increasingly measured, here we 
focus on using more commonly available steady-state expression 
data. The treatments of time-series measurements and observa- 
tional data are left to future publications. Many network inference 
algorithms have been developed for using data of gene KOs [7,9], 
and most of these algorithms produce a single network prediction. 
In contrast, an ensemble inference strategy is adopted in this work. 

In order to illustrate the limitation of using steady-state gene 
expression data for GRN inference, we consider a GRN G^ 
described by the graph in Fig. 1(a). Here, KO of gene A is 
expected to cause changes in the expression of genes B, C, D and 
E at steady state, even though A direcdy regulates only B and D. 
This simple illustration demonstrates that we cannot in principle 
discriminate direct and indirect gene regulations from steady-state 
gene KO expression data [19]. In general, genes that are 
differentially expressed upon knocking out gene / in the GRN 
correspond to those that are directiy and indirectiy regulated by 
gene /, i.e. vertices in G^ that are accessible from the vertex 
Motivated by such a limitation, in TRaCE we first convert gene 
KO data into gene accessibility lists or matrices. As the minimum 
input, TRaCE requires the complete dataset of single-gene KO 
experiments, from which one can construct the accessibility matrix 
Gjj. More specifically, the y'-th element in the /-th row of Gp (i.e. 
(G^), y) is set to 1 when gene j is differentially expressed in the KO 
experiment of gene /. The other elements of G^ are set to 0. The 
detailed procedure of differential expression analysis adopted in 
this work is described in the Numerical Implementation section. 

For data of multi-gene KO experiments, we consider the 
accessibility matrix of Gy^g for an appropriately chosen set of 
genes V/co- In principle, we can determine Gy^g from the 
complete set of experiments involving KOs of the genes in the set 
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(e) <5{s} (f) G' 



Figure 1. (a) An example of a directed grapli Gjj. (b) The transitive closure Gff (in this case, G^ = G^ since is a DAG), (c) The transitive 
reduction GJ, of G^. (d) The directed graph G{g} associated with Gjj. (e) The transitive closure G{b). In this case, the transitive reduction Gjg} happens 
to be the graph G^B}■ (f) The ensemble upper bound G^ obtained from G^ and G{b}. The ensemble lower bound G* obtained from G^ and Gj^j 
happens to be the graph Gjj. 
doi:l 0.1 371/journal.pone.01 0381 2.g001 



Vko and an additional gene / for all ie{V(Gfi)—VKo}- These 
experiments are equivalent to performing single-gene KOs of the 
GRN Gv^o> and therefore Gy^g can be obtained by following the 
same procedure as that for G^ above. As an illustration, consider 
the GRN in Fig. 1(a) with Vko = {-S}. The graph G{5} is given in 
Fig. 1(d). In this case, we can construct the accessibility matrix 
G{5} from the data of two-gene KO experiments, namely {A,B}, 
{C,B}, {D,B} and {E,B} KOs. As these experiments differ from 
each other in only one gene while sharing the KO of gene B, the 
differential expression analysis of the data thus correspond to 
changes in the expression of the GRN Gj^} caused by a single- 
gene KO. Consequently, in this analysis, genes that are found to 
be differentially expressed in the KO of {i,B} are those that are 
accessible from gene / (ie{A,C,D,E}) in the graph Gjbj. For 
example, the KO of {A,B} is expected to cause differential 
expression in genes D and E. The fuU accessibility matrix of G{^} 
is illustrated by the digraph in Fig. 1(e). 

We can generalize the simple example above to any set of genes 
Vko that could be derived from the available multi-gene KO 
experiments. More specifically, we set iGy^g)jj to 1 when 
knocking-out {i,VKo} leads to a differential expression of gene j 
with respect to its expression level in Gy^^. The remaining 
elements of Gy^g are set to 0. Unfortunately, the construction of 
Gkxo of a large GRN Gp would proportionally require a high 
number of KO experiments (the number of KO experiments is 
nexp = n — nyj^g, where n and riy^g is the number of genes in G^ 
and Vko, respectively). However, when Gjj is sparse, Gy^g differs 



from G^ for only a few elements and importantly, these elements 
can be determined from Gjj (see the next section). 

In the theoretical development below, we assume that the 
accessibility matrices Gp and Gyk^^ for A:= 1,2, . . . ,K, have already 
been obtained from the expression data. Here, K denotes the total 
number of accessibility matrices involving subgraphs of the GRN 
G^ that can be constructed from data. For example, from the 
dataset of the complete double-gene KO experiments, we can 
obtain the accessibility matrix Gyk^^ = G{jt} for fc= 1,2, . . . ,n (here, 
K = n). In TRaCE, we consider the ensemble containing all 
digraphs that are consistent with the accessibility matrices G^ and 
Gyk 's, which is the set: 

' KO 

^=/g* : G* = G„ and G> =G^k ,k=\,l,---,K\ (1) 

I. KO ^KO J 

where G*^. is the digraph with K(G*^ )=V(G*) and 

^KO ^KO 
E{G\ ) = E{G*)-Eg*{V'^o)- Note that the GRN G^ is a 

^KO 

member of the ensemble A. The size of the ensemble is a direct 
measure of uncertainty in the network inference problem. A GRN 
is therefore deemed inferable when the ensemble only contains a 
single (unique) network. 

As the number of digraphs in the ensemble is often very large, in 
TRaCE we generate only the lower and upper bounds of the 
ensemble, denoted by G^ and G^ , respectively. The bounds are 
defined such that each digraph in the ensemble is a supergraph of 
G^ and a subgraph of G^ . For GRNs without any cycle (i.e. 
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DAGs), the lower and upper bound GRNs can be obtained from 
the accessibility matrices of and Gj/s^'s (i.e. and G^;. 's) 

and their transitive reductions (i.e. GL and G' 's), using the 

following equations (for details see the Numerical Implementation 
section): 

G' = G',u(\J G;,, ) (2) 

\/c=l KoJ 

= Glr, n (g^^^ U£e^r( Kio)) j (3) 

where G^^, U^grC^^o) denotes the digraph with vertices 
V(Gyk ) and edges E(Gyk )UE^t(V'^o)- Without any Gl,, , 

the upper bound of the ensemble is simply given by the 
accessibility matrix G^ and the lower bound is the transitive 
reduction Gg. As G^,^ is a subgraph of Ga, the transitive closure 

G^l^ is also a subgraph of G^. In Eq. (3), the upper bound is 

constructed starting from Gj in which edges are removed based 
on G^i^ . Here, edges incident to V^q are not altered during the 

intersection of the accessibility matrix G^^, . For example, 

consider again the GRN in Fig. 1 (a) with the accessibility matrices 

G^ and G|gj in Figs. 1(b) and 1(e). The resulting upper bound G^ 

from the combination of these accessibility matrices will have one 

fewer edge than G^, which is the edge (A,C) (see Fig. 1(f)). Thus, 

the size of the upper bound is generally reduced with the 

incorporation of G^^ 's. On the other hand, according to Eq. (2), 
^KO 

the lower bound becomes larger with the inclusion of every 
available G' ^. . In the same example above, the transitive 

reduction of G|gj happens to be the graph Gj^j (i.e. in this case 
G'[B] ~ G{B])- Here, the combination of G^ and Gj^j in Figs. 1(c) 
and 1(d), respectively, gives the lower bound G^ that is equal to 
Gfi. However, in general, G^ is a subgraph of G^. 

Theorem 1 establishes G^ and G^ in Eqs. (2)-(3) as valid lower 
and upper bounds of the set A for DAGs. 

Theorem 1: For G^ and G^ described in Eqs. (2)-(3), the 
following relationship applies: 

G^'^G^G" V GeA 

Proof of G^^G'iGeA : For any edge eeE(G^), Eq. (2) 
implies that either eeE^G'^ or eeE^G'y,, ^. Therefore, we have 
either: 

• ee£-f G'^) =E{G') S^CG) V GeA, or 

• ee£(G;.,J=£((G-£G(K*o))')e£(G) VGeA □ 

Proof of G^G" \/ GeA: If eeE{G) for some GeA, then 
eeE[G^) = E(Ga)- In addition, this edge satisfies either: 



eeE(G - Eo{ K*^)) £ £ ( (G - £c;( ^lo)) ^) = £ {^Ik^^ ^ or 
eeEai F^o) E E^r ( K^o) = E^t ( F^o). 



Therefore, eeE(G"). □ 

Remark: Since G^ is a member of yl, G^ and G^ can also be 
thought as the lower and upper bounds of G^, i.e. G^ S G^ £ G^. 
For DAGs, the members of the set A can be obtained by 
combinatoriaUy adding edges in the set (G'^ — G^) to G^. 
Therefore, the dimension of A is equal to 2^ where A'^ is the 
difference between the number of edges in G^ and G^. Finally, 
Theorem 1 guarantees that when G^ = G^, Gfi is fuUy identifiable, 
i.e. G^ = G" = G^. 

For digraphs with cycles, the upper bound can still be 
constructed using Eq. 3 with G replacing G^ . In this more 
general case, the relationship G^ a G^ in Theorem 1 is stiU valid. 
However, as mentioned earlier, the transitive reduction of 
digraphs with cycles is not unique. In a previous publication 
[19], Wagner proposed a procedure in which digraphs are first 
condensed into DAGs before constructing the transitive reduction 
[19]. Similarly, in TRaCE, each input accessibility matrix is first 
condensed and the transitive reduction algorithm is subsequendy 
apphed to the DAG of the strong components. Here, edges 
incident to the condensations of cycles are also removed. 
Afterwards, the transitive reduction graph is expanded, reversing 
the condensation step. Except for cycles involving two nodes, 
edges of any directed cycle cannot be uniquely prescribed and are 
therefore pruned. The above procedure for reducing digraphs with 
cycles is referred to as Condensation, Transitive Reduction and 
Expansion (ConTREx). The ConTREx of an accessibility matrix 
G, denoted by G, may no longer be a valid transitive reduction (i.e. 
G^ may not necessarily be equal to G). Nonetheless, the lower 
bound constructed using Eq. (2) with G's replacing G''s, satisfies 
G^^Gjj. The proof of this relationship is analogous to the one 
presented for Theorem 1 . However, G^ may not be a member of 
A (see Fig. SI). Finally, the enumeration of digraphs with cycles 
from G^ and G^ is more complicated than that for DAGs. The 
main difference is in the generation of all possible cycles among 
nodes belonging to a particular strong component, constrained by 
G^ and G^ (see an example in Fig. S2). 

Error Correction and Filter. In practice, the accessibility 
matrices constructed from data contain errors. Some elements of 
the accessibility matrices maybe identified as 1 when they should 
be 0 (i.e. false positive, FP), and some maybe identified as 0 when 
they should be 1 (i.e. false negative, FN). These errors can affect 
the lower and upper bound constructed by Eqs. (2)-(3). We denote 
the erroneous lower and upper bound by G^ and G^ , respectively. 
In this case, neither G^ is guaranteed to be a subgraph of G^, nor 
G^ a supergraph of Gfi and G^. 

There are several types of errors affecting G^ and G^. In the 
first case (Type A error), an edge which is not present in Gfj 
(e^E(Gif)) erroneously appears in G^ and G^ (eeE(G^f^G^)). 
Or, an edge in G^ {eeE{Gfi)) is missing from both G^ and G^ 
(e^E(G^yjG^)). As such error affects both G^ and G^ in the 
same manner, this error is not detectable from G^ and G^ . The 
second case (Type B error) involves either a FP in the accessibility 
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Table 1. Types of Errors In and . 



Error 

Type A Type B Type C 

Adj(Gf).. 0 1 0 or 1 0 or 1 

Adj{G")-j 10 10 

Adj{G'').. 10 0 1 



doi:l 0.1 371 /journal.pone.Ol 0381 2.t001 

matrix or a FN in tlie ConTREx matrix. In this case, the resulting 
bounds are stUl consistent with each other and are still valid for 
Gfi. However, the ensemble size and the network uncertainty 
increase due to this error. In the third case (Type C error), an edge 
erroneously appears only in G^, or vice versa, an edge is 
errorneously missing only from G" {eeE(G^)r]E(G"Y), where c 
denote the complement of a set). Here, the bounds become 
inconsistent with each other (i.e., G^^G^). Thus, we refer such 
errors as inconsistent edges, which can be identified by searching 
for edges belonging to G^ that are not in G^ (i.e., from G^ — G^). 
Table 1 illustrates the three types of errors mentioned above. 

A closer scrutiny of Eqs. (2)-(3) reveals that errors from the input 
accessibility matrices are passed on and compounded in the 
bounds. For example, FN errors in or any Gyk^^ will end up in 

G^ , whUe FPs in G^ and any Gyk will also appear in G^. In 
order to reduce the transmission of errors, we have developed a 
filter such that only a subset of edges of Gj,* are used for the 

construction of G^ and G^ . The filter is based on the concept of 
testable edges. Specifically, the testable edges of Gy^^ are any edge 
{iJ)eE{G^) with ij^ ^koi such that there exists a directed path 
from / to j involving one or more genes in V^q. As directed paths 
involving genes in V^q are disconnected in Gj/^co, we can 
potentially verify the existence of the testable edges from Gyt 
and Gyk . For example, the existence of the edge {A,D) in 

Fig. 1(a) can be verified using the transitive reduction of G{£}, 
which is the graph shown in Fig. 1(d). Meanwhile, we can establish 
the absence of the edge (A,C) in Gp using the accessibility matrix 
G{5} (see Fig. 1(e)). The number of testable edges can also be used 
to estimate the information content of a Gy^.^, where a higher 
number of testable edges indicates more informative Gy^g. 

For a given V^o, it is straightforward to show that the testable 
edges are the non-zero entries of the testability matrix: 

Tvko= E (Gh)'®(G>,),-, (4) 

where (G^)' and (G^), are the f-th column and ;-th row of G^, 
respectively, and (x) denotes the outer product. During the 
construction of the lower and upper bound GRNs, the incorpo- 
ration of each Gyk will only need to be performed for the 
KO 

associated testable edges, i.e. non-zero elements of T . 

KO 

Moreover, as the number of testable edges corresponding to 
Gyk is typically small and as such edges can be determined from 

KO 

Ga, only a few rows of Gyk need to be determined from data, i.e. 
KO 

rows of Tyk with non-zero entries. Thus, the number of KO 
KO 



experiments for constructing Gyk could be relaxed by consid- 

KO 

ering only testable edges. 

Numerical Implementation 

The pseudo-codes and the MATLAB implementations of 
TRaCE are provided in the supporting material and the following 
website (http://www.cabsel.ethz.ch/tools/trace). Given steady- 
state gene expression data, we first group the data into datasets 
according to the KO experiments required for the construction of 
the accessibility matrices. We perform differential expression 
analysis for each dataset using Z-score transformation and obtain 
the corresponding accessibility matrix. We provide two imple- 
mentations of TRaCE, one with and another without error 
correction. TRaCE without error correction should only be 
applied when the input accessibility matrices are error free (e.g. for 
inferability analysis). In any other scenario, TRaCE with error 
correction should be used. If desired, a ranked list of gene 
regulatory predictions can also be generated using the lower and 
upper bounds of the ensemble and the differential expression 
analysis. 

Constructing Accessibility Matrices from Expression 
Data. In the case studies, we have employed the Z-score 
transformation for differential gene expression analysis [21]. 
Without loss of generality, we describe below the procedure for 
constructing Gyf-g from the complete set of single-gene KOs of 
Gy^-g, i.e. all possible combinations of {/, Vxo} genes KOs. The 
gene expression dataset is organized into a matrix in which the 
rows correspond to the experiments and the columns correspond 
to the genes. Technical replicates are arranged into separate data 
matrices. For microarray data, the gene expression is typically 
represented by log- 1 0 transformed fluorescence intensity data. The 
following procedure is also illustrated in Fig. 2. 

1 . We first obtain the sample mean fij and standard deviation Sj of 
the expression of each gene j in the dataset. More specifically, 
for each technical replicate, we calculate the sample mean and 
standard deviation of the j'-th column in the data matrix. Then, 
we identify expressions that differ from the mean by more than 
a specified multiple Zaiwff of the standard deviation. We 
subsequently recompute the sample mean and standard 
deviation and Sj by excluding the data beyond z cutoff ■ When 
available, we also use the expression data from the KO 
experiment of genes Vko in calculating fij and S/ . 

2. For each replicate, we compute a z-score matrix z, y for 
i,jeV(G^) according to [21] 

z„ = V (5) 
[O -iijeVKo 
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Figure 2. Construction of accessibility matrix Gv^o from expression data. The data come from KOs of genes in the set Vko and an additional 
gene i, VfeF(Gfi) — Vko- For each replicate, the expression data are arranged into a matrix where the rows correspond to the experiments and the 
columns correspond to the genes. (1) The sample mean and standard deviation of the expression of gene j, denoted by fij and sj, respectively, are 
obtained using the expression data in the y'-th column of the data matrix. (2) For each replicate, a z-score matrix is computed according to Eq. (5). (3) 
Subsequently, the z-score matrices are averaged over the technical replicates to give ^Gv^^ - (4) The accessibility matrix Gv^o determined from 
Zg,,^,^ based on a threshold criterion in Eq. (6). 
doi:1 0.1 371/journal.pone.01 0381 2.g002 



where is the expression level of gene j associated with 
knocking out gene / and genes in Vko- These z-scores reflect 
the significance of changes in the gene expression with respect 
to the GRN Gv^^q- 

3. Subsequently, we average the z-score matrices over the 
technical replicates, producing the overall z-score matrix 

4. We determine the accessibility matrix Gy^o from '^Gy^^g using a 
threshold, as follows: 



fl if ^"^Gy^^ ) . . > ^threshoU 

0 if(ZG„ )''<z„, 
' K.O U 



(6) 



In our c-xpc-ricnce, Zcuoff = 3 and Zihreshold = 2 provide reliable 
network ensembles. In general, choosing higher Zcutoff and ZthreshoU 
wfll lead to fewer FPs but more FNs in the accessibility matrix. For the 
GRN examples considered in this work, the performance of TRaCE 
does not vary considerably within the selected ranges of z,i,reshold 
between 1.5 and 2.5 and Zcutoff between 2 and 3 (see Results). 
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Figure 3. Schematic diagrams of TRaCE witKi and without error correction, (a) Construction of the lower bound and upper bound G'^ 
from G(3 and Gyi 's using TRaCE without error correction. Expression data from gene KO experiments are first converted into accessibility matrices. 
ConTREx is then applied to each accessibility matrix, removing feed-forward edges and edges incident to vertices belonging cycles with more than 2 
nodes. The upper bound is constructed by taking the intersection of the accessibility matrices, while the lower bound is constructed by taking the 
union of the ConTREx outputs, (b) Construction of the lower bound and upper bound G^ from and Gyi 's using TRaCE with error correction. 
Expression data from gene KO experiments are converted into accessibility matrices Gi' and GfJ 's, where the superscript M indicates that these 
matrices may not be transitive due to noise in the measured gene expression levels. Subsequently, the transitive closures of Gj' and G'Jf, 's are 
created, denoted respectively by Gjj and Gyk 's, and the ConTREx of these closures are evaluated. TRaCE with error correction begins with the 

preprocessing of G'"'s and G's to produce the corrected matrices (g^^ and (G^), which are required to determine testable edges. For the 

construction of the lower and upper bounds, the union and intersection of matrices are performed with filtering, denoted by ij and n, respectively, 
where only the relevant testable edges are updated. Two candidate upper bounds are obtained, the first from the matrices G'''''s, denoted by Gf, 
and the second from the matrices G's, denoted by G,'. Meanwhile, the initial lower bound estimate, denoted by Gf, is obtained from the ConTREx 
matrices. The consistency check (CC) is first applied to the pair Gf and Gf to produce the corrected lower bound Gf , and then to the pair Gj and 



supporting material (text SI and text S2). 
doi:1 0.1 371/journal.pone.01 0381 2.g003 



TRaCE without error correction. TRaCE without error 
correction is implemented as matrix-operations of Eqs. (2) and (3). 
Briefly, the upper bound is constructed by performing 

Hadamard (element wise) multiplications of the accessibility 
matrices, excluding the rows and columns corresponding to genes 
in V'^g. On the other hand, the transitive reduction is based on the 
algorithm by Wagner [19], which has been re-implemented using 
matrix operations. When there is no cycle in Gij and Gyk 's, the 



transitive reduction algorithm is applied to each accessibility 
matrbc and the construction of G^ is done by binary additions of 
the transitive reductions, following Eq. (2). Cycles and genes 
involved in cycles can be detected from entries of 
Acc(Gii) X AcciGfif [22]. For GRNs with cycles, the Con- 
TREx procedure is apphed to each available accessibility matrix, 
and the resulting G matrices are again combined using binary 
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Figure 4. Ensemble inference and inferability of 10,000 random 
networks of order (a) n = 10 and (b) n = 100 genes. The mean 
network distances of the lower and upper bounds from are shown 
as a function of network size (i.e. number of edges). The error bars 
indicate the standard deviations. 
doi:1 0.1 371/journal.pone.01 0381 2.g004 

additions to produce G^. The schematic diagram of the error-free 
implementation is shown in Fig. 3(a). 

TRaCE with error correction. The procedure for TRaCE 
with error correction is illustrated in Fig. 3(b). There are two main 
steps in this procedure: (1) the construction of lower and upper 
bounds with filtering and (2) the correction of inconsistent edges. 
The first main step refers to an implementation of Eqs. (2) and (3) 
in which the intersection and union operations involving Gyt^^ are 

performed only for testable edges associated with non-zero entries 
of Tyk . As testable edges are determined from G^, a pre- 
processing step is performed to reduce errors in Gj}. The premise 
behind the pre-processing step is that an error unlikely affects the 
same edge, and that testable edges of any Gyk constitute only a 

small subset of edges in Gtj (i.e. the network is sparse). Following 
this premise, edges that appear in a majority of the accessibility 
matrices (above a certain threshold) are kept, but are otherwise 
removed. In our experience, a threshold of 65% gives a good and 
reliable performance, but any value between 50% to 80% works 
quite well in the case studies (see Results). A more detailed 
description of the pre-processing method can be found in text SI, 
while the filtering algorithm is provided in text S2. 

The schematic diagram of TRaCE with error correction is 
given in Fig. 3(b). We consider two sets of accessibility matrices; 
the first set comes from differential expression analysis (based on 
Zq's) and the second set comes from the transitive closure of the 
first set. We create the second set of matrices since the accessibility 
matrices identified from differential expressions may not satisfy the 



transitivity condition due to errors. The pre-processing step above 
is applied to both sets of matrices. Subsequently, two candidate 
upper bounds are generated using TRaCE with filtering. The 
upper bound obtained from the first set of accessibility matrices, 
denoted by G^ , is expectedly smaller (in size) than the bound from 
the second set, denoted by G^'- Note that ConTREx is only 
applicable to transitive digraphs, and therefore is applied only to 
the transitive closures (i.e. the second set). Using TRaCE with 
filtering, a candidate lower bound, denoted by Gf, is generated 
from the results of ConTREx. 

The last step in the procedure is to correct inconsistent edges, 
which is done by voting. For each inconsistent edge, we compared 
the number of times that the edge is present in the accessibility 
matrices and the ConTREx results (supporting the presence of the 
edge), with the number of times that the edge is absent from the 
accessibility and ConTREx matrices (supporting the absence of 
the edge). The upper bound is corrected (by addition of this edge) 
when the presence of the edge receives a (simple) majority vote. 
Vice versa, the lower bound is txjrrected (by removal of this edge) 
when the absence of the edge receives a majority vote. In the case 
of no majority vote, the edge is added to the upper bound and 
removed from the lower bound. The detail of the consistency 
check is described in text S3. As shown in Fig. 3(b), the consistency 
check and correction are first performed for the pair Gp and Gf , 
and subsequently the corrected lower bound, denoted by G2 > is 
compared with G2 to obtain the final corrected G^ and G^. 

Ranking of Edges from Ensemble Bounds. If desired, a 
ranked list of edges can be generated using tlu' lower and upper 
bounds of TRaCE in conjunction with the average z-scores for 
G^, i.e. Z(jj. Here, we carry out the ranking of regulatory edges in 
two phases. In the first phase, we rank subsets of edges according 
to the lower and upper bounds in the following order: edges in 
Si = G^, edges in 1.2 = G^- G^ , edges in Z3 = G^- G^, 
edges in £4 = G^ — G^ , and finally edges in S5 = (G^Y . In the 
second phase, we rank the edges within individual subsets 
according to the average z-scores. We implement the second 
phase by first computing the overall scores according to 



' (Zg ) + max Zkj 

f 1,1 (kJ)El.2 

{Zg„).+ max Zkj 
(Zg„).+ max Zkj 
(Zg„) . .+ max Zk,i 



if (v)e2i 
if (v)622 
if (v)eL3 
if (ij}el.4 

if (iV)6S5 



(V) 



Following the submission requirement of DREIAM 4 network 
inference challenge, we then assign a confidence score Xy to the 
edge (ij) according to: 



(8) 



maXijZij 



A score Xij of 1 reflects the highest confidence of the existence 
of an edge (ij), and vice versa a zero confidence score indicates 
certainty in the inexistence ofQJ). Finally, the ranked list of edges 
is generated by sorting the edges in decreasing order of confidence 
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Figure 5. Examples of the ensemble inference of random 
networks with 10 genes. In case I, the GRN has 8 edges and is 
inferable from the accessibility matrices and as few as three G{,}'s. In 
case II, the GRN has 13 edges and is not inferable. 
doi:1 0.1 371/journal.pone.01 0381 2.g005 
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Figure 7. Ensemble inference of £ coli GRN from error-free 
and the complete set of G{,)'s. The plot shows the network 
distances of the lower and upper bounds from as a function of the 
number of G{,}'s for the 50 most informative G{,}'s, i.e. the top 50 
highest number of testable edges. The incorporation of G{,}'s and G{,}'s 
was performed sequentially in decreasing number of testable edges. 
The Inset shows the result for the complete set of G{,}'s. 
doi:1 0.1 371/journal.pone.010381 2.g007 



scores. A similar procedure, called down-ranking, has been 
presented in Pinna et al. [23], where feed-forward edges are 
ranked lower than edges in the transitive reduction of the 
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Figure 6. Ensemble inference and inferability of 5,000 random 
scale-free networlcs of order (a) n = 10 and (b) n = 100 genes. The 

mean network distances of the lower and upper bounds from G^ are 
shown as a function of network size (i.e. number of edges). The error 
bars indicate the standard deviations. 
doi:1 0.1 371/journal.pone.01 0381 2.g006 



accessibility matrix. However, the down-ranking algorithm is 
described only for data of single-gene KO experiments. 

Results 

Inferability Analysis 

We first appKed TRaCE to error-free accessibility matrices of 
and Gyk by assuming ideal data (unbiased and error free) for 

the purpose of inferability analysis. Such an analysis is analogous 
to a priori identifiability analysis in the kinetic modeling of 
biological networks [12]. Here, we evaluated the network distances 
between the lower and upper bounds and the GRN, i.e. the 
numbers of edges in the set — G^ and G^ — G^, respectively. 

Random GRNs. We investigated the inferability of random 
GRNs of orders «= 10 and «= 100 genes. We set the network size 
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Figure 8. Ensemble inference of 5. cerevisieae GRN from error- 
free G^ and the 100 most informative (?{,'} 's based on the 
number of testable edges. The plot shows the network distances of 
the lower and upper bounds from Gf, as a function of the number of 

G{,|'s. The incorporation of G{,j's and Gj,j's was performed sequentially 
in decreasing number of testable edges. 
doi:1 0.1 371/journal.pone.01 0381 2.g008 
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Table 2. Ensemble Inference of £ coli subnetworks («= 100 genes). 





FPR 


FNR 


Before Correction 




After Correction 










m(G^-G") 


ra(G'--G,) 


m(G^-G'') 


ra(G'--Gp) 




0.00 


0.00 


0 


0 


0 


0 


122.62 


0.00 


0.10 


183.4 


35.86 


3.7 


1.88 


118.2 


0.00 


0.20 


191.1 


67.36 


16.56 


8.7 


113.52 


0.10 


0.00 


0 


1412.42 


0 


0.24 


150.4 


0.10 


0.10 


183.02 


1441.68 


2.72 


1.42 


146.3 


0.10 


0.20 


191 


1460.82 


12.06 


2.1 


146.74 


0.20 


0.00 


0 


2324.46 


0 


0.18 


213.42 


0.20 


0.10 


1 84.02 


2354.16 


2.72 


0.46 


197.28 


0.20 


0.20 


191 


2386.88 


9.94 


1.62 


210.12 



The reported values represent the averages over 50 subnetworks. FPR (FNR) is the ratio between the number of FP (FN) in the accessibility matrices and the number of 
edges In G^. Let m(A — B) of any two digraphs A and B denote the number of edges in the set EiA) — E(B). 
doi:1 0.1 371 /journal.pone.Ol 0381 2.t002 



(i.e. number of edges) between 0 and 3n randomly with equal 
probability, and assigned the edges without any preference. The 
upper size limit of 3« was chosen based on the ratio between the 
number of edges and the number of nodes in E. coli and yeast 
GRNs [24]. For each random network, we generated M+1 
accessibility matrices associated with and G{,} for every 
ieViG^). These accessibility matrices correspond to performing 
the full set of single- and double-gene KO experiments. 

We applied TRaCE without error correction to construct the 
ensemble lower and upper bounds for each random network using 
the aforementioned accessibility matrices. The mean network 
distances of the bounds from Gfs are shown in Figs. 4(a) and (b) as 
a function of network size. Here, we plotted the network distances 
of the lower bound using negative numbers and those of the upper 
bound using positive numbers. By doing so, we could illustrate the 
distance between the lower and upper bounds in the same plot. In 
particular, the number of edges in the set — G^ is equal to the 
distance between the two network distance curves in Fig. 4. Not 
surprisingly, the network distance increased with the size of the 
networks, i.e. larger networks are more difficult to infer than 

Table 3. Ensemble Inference of £ coli GRN. 



smaller networks. The difference between the lower and upper 
bounds also broadened with network size, indicating higher 
network uncertainty in the inference of larger GRNs. For networks 
containing fewer edges than nodes, the GRN G^ could generally 
be recovered from and G|,|'s. Nevertheless, Fig. 4 demon- 
strated that the GRNs were typically (64% for 10 gene networks 
and 76% for 100 gene networks) not inferable, since the lower and 
upper bounds did not converge. 

Fig. 5 shows two examples of GRN inference of order « = 1 0 
genes. In the first case (case l,m= 8 edges), Gfs could be recovered 
from Gfi and as few as 3 G{,j's, while in the second case (case II, 
»7=13 edges), the inference problem was underdetermined. 
Moreover, the results suggested that G{,}'s were not equally 
informative, as the reduction in the distance between the lower 
and upper bounds by incorporating an additional G^,} was not 
vuiiform. 

Random scale-&ee GRNs. Many cellular networks have 
been shown to be scale-free with a power-law degree distribution 
[25], where the majority of the nodes have low degrees (1 to 2) and 
a few nodes (called hubs) are of high degrees. We also tested the 
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0.00 
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0.00 
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3746 
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3092 
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0.00 
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34796 
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4636 


0.10 


0.10 


3573 


35566 


14 
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4624 


0.10 


0.20 


3746 
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4556 


0.20 


0.00 
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62035 


0 
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14933 


0.20 


0.10 


3554 
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14773 


0.20 


0.20 


3741 
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40 
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14392 



FPR (FNR) is the ratio between the number of FP (FN) in the accessibility matrices and the number of edges in G,5. Let m(A — B) of any two digraphs A and B denote the 
number of edges in the set E(A) — E{B). 
doi:1 0.1 371 /journal.pone.Ol 0381 2.t003 
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Table 4. Ensemble Inference of S. cerevisiae GRN. 
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0.10 
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198762 


0.10 
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198883 
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0.00 
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0.20 


0.10 
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0.20 


0.20 
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8 


2 


260313 



Let m(A — B) of any two digraphs A and B denote the number of edges in the set E{A) — E(B). 
doi:1 0.1 371 /journal.pone.Ol 0381 2.t004 



1565 genes and 3758 edges, while the yeast GRN comprise 4441 
genes and 12873 edges. For E. coli, we generated the accessibility 
matrices of and all G{,}'s. To reduce computational 

complexity, in the case of yeast, we used only the 100 most 
informative G{,}'s based on the number of testable edges (i.e. the 
number of non-zero elements in the testability matrix T'f,} in Eq. 
(4)). The results are shown in Figs. 7 and 8. Not surprisingly, both 
E. coli and yeast GRNs could not be completely inferred from the 
above accessibility matrices. There was a diminishing return of 
information after about 25 and 50 G{,}'s for the inference of£. coli 
and yeast GRNs, respectively. 

Ensemble inference from errorneous accessibility 
matrices 

We evaluated the performance of TRaCE with error correction 
using E. coli GRN and subnetworks, as well as yeast GRN. False 
positive errors were simulated by randomly adding edges to the 
accessibility matrices, while false negatives were simulated by 
randomly removing edges from the accessibility matrices. The 
performance of error correction in TRaCE was judged by the 
number of erroneous edges that remained in the bounds after 
correction for different FP and FN rates (abbreviated as FPR and 
FNR, respectively), defined with respect to the size of Gij. 

E. coli GRNs. We first used TRaCE with error correction for 
the ensemble inference of 50 random subnetworks of £. coli GRN 
with « = 100 genes, generated using GeneNetWeaver [24]. The 
average number of edges was 192. As in the above case study, we 



Table 5. Ensemble inference of DREAM4 100-gene gold standard networks: single-gene KO dataset. 





Networl< 
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m(G,) 


m(G^-G'^') 


m(G'--G^) 


m(G'-'-G'-) 
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2.28 


176 


44 
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136 


2 
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1.96 


249 


122 
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123 


3 


0 


11.04 


195 


86 


21 


144 


4 


0 


9.69 


211 


91 


7 


208 


5 


0 


3.47 


193 


108 


12 


237 



FPR (FNR) is the ratio between the number of FP (FN) in the accessibility matrices and the number of edges in G^. Let m(Gj^) denote the number of edges In G0, and 
m(A — B) of any two digraphs A and B denote the number of edges in the set EiA) — E(B). 
doi:l 0.1 371/journal.pone.Ol 0381 2.t005 



performance of TRaCE using random scale-free networks. Here, 
we constructed two sets of 5000 scale-free GRNs with order «= 10 
and n = 100 genes using the Barabasi-Albert model [26]. Briefly, 
the GRNs were grown from a random seed network of small size 
(with 3 vertices) by sequentially adding nodes to the network. For 
each node addition, between 1 and 5 new edges were inserted to 
the network connecting the new node with existing ones, in a 
manner such that the degree distribution decayed exponentially. 
Again, for the purpose of inferabHiity analysis, we generated « + 1 
error-free accessibility matrices and all G{,}'s, equivalent to 
having ideal data from single- and double-gene KO experiments. 

We used the error-free implementation of TRaCE to construct 
the ensemble lower and upper bounds for each of the random 
scale-free GRNs. Fig. 6 shows the mean network distances of the 
bounds as a function of network size. Similar to the random 
GRNs, most (79% for 10 gene networks and 75% for 100 gene 
networks) scale-free GRNs were not inferable from single and 
double-gene KO experiments, as the ensemble lower and upper 
bound did not meet for the majority of the networks. The mean 
network distance of the lower and upper bounds again increased 
with network size. However, the inference of scale-free GRNs 
from the accessibility matrices Gjg and G{,|'s appeared to be more 
difficult than that of random GRNs, as suggested by the larger 
distances between the lower and upper bounds for scale-free 
GRNs than for random GRNs of the same size. 

E. coli and S. cerevisiae GRNs. Finally, we investigated the 
inferabUity of large, realistic GRNs of E. coli and S. cerevisiae 
available in GeneNetWeaver [24]. The E. coli GRN consists of 
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Figure 9. Comparison of TRaCE and top performing metfiods in DREAM4 100-gene networit inference subcfiaiienge: single-gene 
KO dataset. The error bars represent the standard deviations. Based on the AUROC values, TRaCE performed as well as the downranking method 
(p = 0.5) and GENIE3 (/; = 0.3), but better than TIGRESS (/; = 0.002). Based on the AUPR values, TRaCE performed as well as the downranking method 
(p = 0.5), but better than GENIE3 (y) = 0.03) and TIGRESS (;; = 0.01). The statistical significance was evaluated using two sample t-test. 
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created the accessibility matrices of Gfi and every G{,}. We 
subsequently contaminated these matrices with FP and FN errors 
at the specified rates without any preference. The accuracy of the 
lower and upper bounds constructed using TRaCE with and 
without error correction is summarized in Table 2. 

As in many scenarios above, none of the subnetworks was 
inferable. FP errors could be very eflFectively eliminated by error 
correction. FNs errors expectedly led to missing edges from the 
upper bound, as indicated by the number of edges of G(j that did 
not appear in G'^ (see m(G0 — G^) in Table 2). The error 
correction could not completely eliminate Type A errors, leading 
to erroneous edges that appeared in the lower bound but did 
not belong to G^ (see m(G^ — G^) in Table 2). A combination of 
FP and FN errors were more easily corrected than FN errors 
alone. While FNs were more difficult to eliminate than FPs, 
correcting FP errors tended to produce larger network ensembles 
than FNs, indicating higher network uncertainty (see 
m[G^ — G^) in Table 2). Nevertheless, even in the worst case 
(0% FP, 20% FN), roughly 90% of the errors in the lower and 
upper bounds could be removed by the error correction (compare 
wi(G^ — G^) + wi(G^ — G^) before and after correction). 

For the inference of E. coli GRN, we generated erroneous 
accessibility matrices G^ and the 100 most informative G{,}'s 
corresponding to the top 1 00 highest numbers of testable edges. 
The performance of TRaCE with error correction for different FP 
and FN rates is summarized in Table 3. In addition, the structural 
Hamming distances of the lower and upper bounds before and 
after correction are reported in tables SI and S2. As before, 



TRaCE with error correction could handle FPs more effectively 
than FNs, and a mixture of FP and FN errors in the accessibility 
matrices were more easily eliminated than FN alone. In the worst 
case (0% FP, 20% FN), more than 95% of the errors were 
corrected. The size of the ensemble also depended strongly on the 
FP errors, and at 20% FP, the number of edges between the lower 
and upper bound reached three times the size of the full GRN. 

S. cerevtsiae GRN. For yeast GRN, we generated erroneous 
G(j and the 100 most informative Gj/j's. The results of TRaCE 
with error correction using these accessibility matrices are 
summarized in Table 4. The performance of TRaCE here was 
notably better than the inference of E. coli GRN. In all cases, 
TRaCE could rectify almost all erroneous edges. However, the 
correction came at a price of high uncertainty, where the 
difiFerence between the lower and upper bounds exceeded 20 
times the number of edges in G^. Despite such high uncertainty, 
the gap between the bounds represented only 1.3% of the total 
possible edges. 

Ensemble inference from expression data 

We further evaluated the performance of TRaCE using in silica 
noisy gene expression data generated using GeneNetWeaver [24] . 
We simulated steady-state gene expression data using the same 
settings as those in DREAM4 1 00-gene in silica network inference 
subchallenge. The simulated data are available at http:/ /www. 
cabsel.ethz.ch/tools/trace or upon request. In the following case 
studies, we analyzed and converted the data into accessibility 
matrices following the procedure described in the Numerical 
Implementation section. Subsequently, we used the resulting 



Table 6. Ensemble inference of DREAM4 100-gene gold standard networks: single- and double-gene KO dataset. 





Network 


FPR 


FNR 


m(Ge) 








1 


0.02 


2.21 


176 


50 


5 


27 


2 


0.01 


1.91 


249 


137 


5 


37 


3 


0 


10.43 


195 


103 


15 


30 


4 


0 


9.31 


211 


118 


8 


35 


5 


0.01 


3.40 


193 


131 


12 


21 



FPR (FNR) is the average ratio between the number of FP (FN) in the accessibility matrices and the number of edges in G^. Let m(G0) denote the number of edges in 
G^, and m(A — B) of any two digraphs A and B denote the number of edges in the set E(A) — E{B). 
doi:l 0.1 371 /journal.pone.01 0381 2.t006 
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1 2 3 4 5 avg 1 2 3 4 5 avg 

Network Network 



Figure 10. Comparison of TRaCE and top performing methods in DREAiVI4 100-gene networl< inference subclialienge: single- and 
double-gene KO dataset. The error bars represent the standard deviations. Based on the AUROC values, TRaCE performed as well as GENIE3 
(p = 0.8) and better than TIGRESS = 0.0005). Similarly, based on the AUPR values, TRaCE performed better than GENIE3 (y; = 0.04) and TIGRESS 
(p = 0.01). The statistical significance was evaluated using two sample t-test. 
doi:1 0.1 371 /journal.pone.01 0381 2.g010 



accessibility matrices with TRaCE to produce the ensemble lower 
and upper bounds. For the purpose of comparison with existing 
network inference methods, we also ranked the gene regulatory 
interactions according to their confidence scores. In particular, we 
compared the rankings with those from top performing inference 
methods in DREAM4, namely the downranking method by Pinna 
et al. [27], GENIE3 [28] and TIGRESS [29]. As mentioned 
earlier, the downranking method follows a two-phase procedure in 
ranking edges similar to our implementation, but the method only 
down-ranks feed-forward edges. On the other hand, GENIE3 uses 
a machine learning strategy called random forest, and TIGRESS 
is a regression-based method. For each method, we calculated the 
area under receiver operating characteristics (AUROC) and 
precision-recall (AUPR) using a redefined confusion matrix, in 
which methods were not penalized for any error within the set of 
non-inferable edges. We define non-inferable edges as edges 
belonging to the upper bound that are missing from the lower 
bound (i.e. all edges in the set - G\ which are determined 
from error-free accessibility matrices of the gold standard GRNs. 
More details of the calculation of the AUROC and AUPR can be 
found in a recent publication [30]. 

DREAM4 in silico network inference 100-gene 
subchallenge. We first simulated 5 replicates of steady state 
gene expression data associated with the complete single-gene KO 
experiments. We then used the data to construct the accessibility 
matrices of the gold standard GRNs. In this case, the upper bound 
of the ensemble was simply given by the accessibility matrix, and 
the lower bound was the ConTREx of the upper bound. Table 5 
shows the FPR and FNR in the accessibility matrices, the errors in 



the lower and upper bounds imi^Gfi— G^) and m[G^ — Gfi)), 
and the size of the ensemble [m{G^ — G^)) constructed using 
TRaCE. We noted that the majority (90%) of FN errors in the 
accessibility matrices were associated with fan-in motifs, in which a 
gene was regulated by several genes. In such a case, the effect of 
knocking-out one of the regulator genes could be compensated by 
the others and thus, the KO experiment did not show any 
significant differential expression of the downstream genes. As FNs 
affected the accessibility matrices, the errors in the upper bound 
m{^G^— G^) were higher than those in the lower bound 
m(G^ — G^). 

Subsequently, we created a ranked list of gene regulatory 
predictions and compared the list against those produced by the 
downranking method, GENIE3 and TIGRESS. Fig. 9 provides 
the comparison of AUROC and AUPR of the four methods. The 
comparison showed that TRaCE and the downranking method 
outperformed GENIE3 and TIGRESS, especially considering the 
AUPR values. Here, TRaCE performed as well as the down- 
ranking method, which was the best overall performer in DREAM 
4 100-gene network inference subchallenge [27]. 

Ensemble inference from single- and double-gene 
Kos. We further simulated 5 replicates of steady state gene 
expression data for the complete set of single- and double-gene 
KO experiments using the gold standard GRNs of DREAM4 100- 
gene subchallenge. We processed the data to obtain the 
accessibility matrices of G^ and all G{,}'s. We subsequentiy 
applied TRaCE with error correction to the accessibility matrices 
to obtain the ensemble lower and upper bounds, which are 



Table 7. Effects z cutoff and z,i,resiwid on AUROC and AUPR of TRaCE in DREAM4 100-gene subchallenge: single gene KO data. 





Ztbiesliald 


Zcinoff 


AUROC 


AUPR 


1.5 


2 


0.871 ±0.0616 


0.3085 ±0.1 41 6 


1.5 


3 


0.8723 ±0.0673 


0.3209 ±0.1 534 


2 


2 


0.8756 ±0.0595 


0.2922 ±0.1 54 


2 


3 


0.8758 ±0.0642 


0.2889 ±0.1 596 


2.5 


3 


0.8758 ±0.0642 


0.2438 ±0.1 545 



The AUROC and AUPR values are the average ± standard devation over 5 gold standard networks. The values used in the comparison with existing methods are 

highlighted in bold. 

doi:l 0.1 371/journal.pone.Ol 0381 2.t007 
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Table 8. Effects of z„„„ff, z,hresiwid and the preprocessing threshold of on AUROC and AUPR of TRaCE in DREAM4 100-gene 
subchallenge: single- and double-gene KO data. 





Zihveshold 


Zvmoff 


threshold 


AUROC 


AUPR 


1.5 


2 


0.5 


0.8723 ±0.0595 


0.4327±0.1931 






0.65 


0.871 8 ±0.0588 


0.41 97 ±0.1 959 






0.8 


0.8724 ±0.0600 


0.41 84 ±0.2022 


1.5 


3 


0.5 


0.8728 ±0.0596 


0.4204 ±0.1 986 






0.65 


0.8733 ±0.0610 


0.421 2 ± 02059 






0.8 


0.8735 ±0.061 7 


0.4201 ±02100 


2 


2 


0.5 


0.8726 ±0.0603 


0.4177±0.2158 






0.65 


0.8727 ±0.0607 


0.4167±02170 






0.8 


0.8728 ±0.0608 


0.4091 ±02139 


2 


3 


0.5 


0.8735 ±0.061 8 


0.41 70 ±021 60 






0.65 


0.8734 ± 0.0620 


0.4086 ±0.21 49 






0.8 


0.8698 ±0.0604 


0.4045 ±02 108 


2.5 


3 


0.5 


0.8695 ± 0.0599 


0.3893 ±0.2054 






0.65 


0.8696 ±0.0598 


0.3837 ±0.1 975 






0.8 


0.8695 ± 0.0600 


0.3764 ± 02023 



The AUROC and AUPR values are the average ± standard devation over 5 gold standard GRNs. The values used in the comparison with existing methods are 

highlighted in bold. 

doi:1 0.1 371 /journal.pone.Ol 0381 2.t008 



summarized in Table 6. Tlie average FPR and FNR were .similar 
to the single-gene KO data since both datasets had the same 
number of replicates. Again, the majority (80%) of errors in the 
accessibility matrices were associated with fan-in motifs. By 
comparing Tables 5 and 6, the errors in the lower bounds 
improved slighdy in comparison with those from only single-gene 
KO dataset (compare m(G^— G^) values). However, the errors 
in the upper bounds increased due to the accumulation of FN 
errors from fan-in motifs (compare m(G^— G^) values). Never- 
theless, the additional data from double-gene KO experiments led 
to lower network uncertainties (compare m(G^ — G^) values). 

For each gold standard GRN, we also generated a ranked list of 
edges based on the confidence scores and compared the list with 
those using GENIE3 and TIGRESS. The downranldng method 
could not be applied to double-gene KO data and was left out 
from the comparison. The AUROC and AUPR of the three 
methods are compared in Fig. 10. The AUPR of TRaCE was 
higher than GENIE3 and TIGRESS. Meanwhile, the AUROC 
values were generally high for all three methods, with TIGRESS 
having the lowest value. In comparison to single-gene KO data, 
the inclusion of double-gene KO data led to no change in the 
average AUROC (/7 = 0.52, two sample t-test) and an increase in 
AUPR (/7 = 0.17, two sample t-test). Finally, as shown in Tables 7 
and 8, the AUROC and AUPR values of TRaCE were insensitive 
to ^cutoff and z,i„.eshokl used in the construction of the accessibility 
matrices, and the threshold value in the preprocessing of G^. 
Because of the trade-off between FPs and FNs when using different 
^cutoff and Ziiii-eshold, the maximum of AUROC and AUPR 
corresponded to intermediate values within the selected ranges of 

^cutoff and Ztlji-eshold- 

E. coli and Yeast GRNs. Finally, we simulated the complete 
set of single-gene KO experiments for E. coli and yeast. For each 
organism, we generated 10 replicates of steady state gene 
expression data. We performed differential expression analysis 
using the Z-score transformation and obtained the accessibility 



matrices using either ,5 or 10 replicates. We then applied TRaCE 
with error correction to construct the ensemble lower and upper 
bounds, and created ranked hsts of edges as done earlier. The 
errors in the accessibility matrices and in the bounds are reported 
in Table 9, along with the AUROC and AUPR values. Here, FPR 
in the accessibility matrices decreased with increasing number of 
replicates, but FNR did not change with the number of replicates. 
The errors in the upper bounds changed little with increasing 
technical replicates from 5 to 10, but those in the lower bounds 
dropped considerably. The sizes of the ensemble also decreased 
with increasing replicates. Finally, the AUPR values improved 
slightly with higher replicates, while the AUROC values were 
insensitive with respect to the number of replicates. 

Discussion 

The inference of GRNs from data of gene perturbation 
experiments is an important but unsolved problem. The difficulty 
stems from the underdetermined nature of such an inference [7,9], 
as the data do not contain the necessary information to establish 
the complete causal interactions among the genes. Consequently, 
there exist many indistinguishable solutions. We have developed 
TRaCE with this consequence in mind by employing an ensemble 
inference strategy. Specifically, we have taken into consideration 
the fundamental limitation in using steady-state expression data of 
gene KO experiments for establishing direct causal relationships 
among genes. In TRaCE, we first transform the expression data 
into accessibility relationships (matrices) among genes. The novel 
contribution of TRaCE is an algorithm for the construction of 
lower and upper bounds of network ensemble, where each 
member of the ensemble satisfies the accessibility matrices. Edges 
of the upper bound that do not appear in the lower bound are 
considered non-inferable, as the existence of such edges can not be 
verified. Here, the size of the ensemble provides a metric of 
uncertainty in the network inference problem, with which the 
GRN inferability can be rigorously assessed. The GRN is inferable 
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Table 9. Ensemble Inference of £ coli and Yeast GRNs from single-gene KO data. 



Network 


Replicates 


FPR 


FNR 




m(G'--Gj) 


m(G''-G'-) 


AUROC 


AUPR 


E. coli 


5 


0.117 


2.61 


1611 


687 


1724 


0.8601 


0.4836 


E. coli 


10 


0.005 


2.61 


1612 


297 


1673 


0.8639 


0.5192 


Yeast 


5 


0.582 


24.97 


8101 


9159 


7907 


0.7699 


0.2464 


Yeast 


10 


0.141 


24.98 


8098 


3753 


7200 


0.7569 


0.2768 



FPR (FNR) is the ratio between the number of FP (FN) in the accessibility matrices and the number of edges in Gg. Let m(A — B) of any two digraphs yl and i? denote the 
number of edges in the set E(A) — E(B). 
doi:l 0.1 371/journal.pone.Ol 0381 2.t009 



when the lower and upper bounds coincide (i.e. the ensemble only 
contains one network). Thus, in TRaCE, the inference and 
inferability analysis are accomplished simultaneously. 

In the case studies, we have demonstrated the use of TRaCE for 
analyzing the inferability of GRNs. With the exception of networks 
of low order and small size, the majority of the GRNs were not 
inferable even when using error-free accessibility matrices. As we 
have used sparse networks in the case studies, the lower bound of 
the ensemble was a better estimate of the GRN than the upper 
bound. Finally, the majority of double-gene KO experiments were 
non-informative as the reduction in the size of the ensemble 
diminished after only a small number of G{,}'s. The observation 
above suggests that experimental design contributes significantly to 
the underdetermined nature of the typical GRN inference. In this 
regard, the lower and upper bounds of the ensemble could be used 
for optimizing the gene perturbation experiments, for example by 
finding the KO experiment that provides the maximum reduction 
in the difference between the lower and upper bounds. A strategy 
for optimal design of experiments using ensemble inference will be 
presented in a future publication. 

We have also used the ensemble lower and upper bounds in 
conjunction with the z-scores to produce a ranked list of gene 
regulatory predictions. In comparison with the top methods of 
DREAM4 network inference challenge, TRaCE could match the 
performance of the downranking method, the best overall 
performer in the 100-gene subchallenge. For single-gene KO 
dataset, TRaCE and the downranking method differed only for 
edges that were involved in cycles of more than 2 nodes. However, 
the two methods were fundamentally different, as TRaCE was 
developed for ensemble inference. Furthermore, the downranking 
method was created for single-gene KO experiments. Meanwhile, 
TRaCE significantly outperformed GENIE3 and TIGRESS when 
using single- and double-gene KO data. We note that GENIE3 
and TIGRESS were also among the best performers in DREAMS 
network inference challenge [10]. 

As expected, data noise negatively influenced the GRN 
inference and increased the uncertainty in the GRN inference. 
In the case studies, random errors in the accessibility matrices 
expectedly led to a larger ensemble. While the error correction in 
TRaCE was able to eliminate the majority of errors, some of the 
errors remained in the bounds. In particular, FN errors were 
harder to correct than FPs. The reason was that more Type A 
errors originating from FNs passed through the correction than 
those originating from FPs (see Fig. S3 and text S4). Meanwhile, 
FP errors could cancel out some Type A errors associated with 
FNs at the cost of increased uncertainty (see Fig. S4 and text S4). 

The application of TRaCE to simulated noisy gene expression 
data indicated that the majority of errors in the accessibility 
matrices were due to FNs associated with fan-in motifs in the 
GRN. In such motifs, the effects of knocking-out one regulator 



gene could be compensated by other regulator(s), and differential 
expression analysis could only reveal the dominant regulator(s) of a 
gene. Note that such a problem could not be improved by 
increasing technical replicates (see Table 9). Meanwhile, errors 
associated with fan-in motifs usually lead to type A errors where 
the affected edges are absent from the lower and upper bounds. 
However, if the available experiments permit the construction of 
Gy,,g in which Vko includes the dominant regulator(s) of a fan-in 
motif, then the related missing edge(s) may appear in Gy^^, 
leading to a detectable and correctable type C error. We expect 
the issue above would improve when using more sensitive 
measurements of gene expression, for example RNAseq. 

Conclusion 

Inferring gene regulatory networks from DNA microarray data 
is an unsolved problem. Community- wide assessments of inference 
methods have shown that distinguishing direct and indirect 
regulatory interactions among genes is a common AchiUes' heel 
of existing algorithms. In this study, we have adopted an ensemble 
inference strategy and develop new framework and algorithms for 
the creation of an ensemble of networks. Here, the ensemble 
represents the uncertainty associated with differentiating direct 
and indirect regulations using steady-state gene expression data of 
gene KO experiments. In particular, TRaCE produces the lower 
and upper bound of the ensemble. Using the bounds of the 
ensemble, a ranked list of gene regulatory predictions can also be 
generated. The case studies demonstrate that except for networks 
with few edges, most GRNs can not be fully inferred even when 
error-free data from complete single- and double-gene knock out 
experiments are available. In comparison with top performing 
methods of DREAM4 in silico network inference challenge, 
TRaCE performed equally well with the downranking method, 
the best overall performer in the challenge. However, the 
downranking method is not designed to handle data from multi- 
gene KO experiments. Meanwhile, TRaCE outperformed GE- 
NIE3 and TIGRESS when using single- and double-gene KO 
data. Nevertheless, the uncertainty in GRN inference is still 
significant, and systematic KOs of genes are often suboptimal as 
only a small fraction of the experiments are informative. We 
therefore hope that the shift of paradigm to ensemble inference 
will trigger further developments of methodologies for inferring 
and analyzing GRNs, in which network inference uncertainty is 
explicitiy taken into consideration. 

Supporting information 

Figure SI An example of ConTREx of a simple GRN. 

The directed edges indicated by blue arrows in are in the set of 
indirect regulations. The edges between A and B are retained. 
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because the cycle contains only two nodes. If the cycle had 
contained more than two nodes, all edges among the nodes would 
have been removed. 
(TIFF) 

Figure S2 Example of an ensemble involving GRN with 
cycle. Consider a GRN consisting of genes A, B and C, all of 

which are involved in a directed cycle. Further, let us assume that 
the edge C— > A belongs to the lower bound and A->C is not in the 
the upper bound. In this case, the ensemble comprises the graphs 
shown in (a)-(d). 
(TIFF) 

Figure S3 Type A errors due to an FN. (a) Gjj; (b) Gjj; (c) 
(d) G{B\'^ i^) ^{B} ^'^^^ '^'^ '^^ C->D. In this case, 
G|^j = G^j. (f) G||j with an FN at C^D and an FP at C^A (g) 
ConTREx reduction of (f). (h) G^j with an FN at C^^D and an 
FP at C-^E (i) ConTREx reduction of (h). See text S4 for details. 
(TIFF) 

Figure S4 Type A error due to an FP. G^ and G^ are shown 

in Fig. S3 (a) and (b), respectively, (a) G{c}. In this case. 



G{c}- (b) G^jwith an FP at B^D. Here, G; 



{cy 



(TIFF) 

Table SI Performance of TRaCE on inference of £. coli 
subnetworks (#1 = 100 genes). The reported values represent 
the average over 50 subnetworks. Let D(A — B) of any two 
digraphs A and B denote the structural Hamming distance (SHD) 



between them. The SHD is defined as the number of edges which 
difier or have opposite orientation between two networks [31]. 
(PDF) 

Table S2 Performance of TRaCE on inference of E. coli 

GRN. Let D(A — B) of any two digraphs A and B denote the 

structural Hamming distance between them. 

(PDF) 

Text SI Preprocessing of Gjj. 

(PDF) 

Text S2 Filter Algorithms. 

(PDF) 

Text S3 Consistency check. 

(PDF) 

Text S4 Type A errors due to FN. 

(PDF) 
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